Record Linkage with Uniqueness Constraints and Erroneous Values
نویسندگان
چکیده
Many data-management applications require integrating data from a variety of sources, where different sources may refer to the same real-world entity in different ways and some may even provide erroneous data. An important task in this process is to recognize and merge the various references that refer to the same entity. In practice, some attributes satisfy a uniqueness constraint—each real-world entity (or most entities) has a unique value for the attribute (e.g., business contact phone, address, and email). Traditional techniques tackle this case by first linking records that are likely to refer to the same real-world entity, and then fusing the linked records and resolving conflicts if any. Such methods can fall short for three reasons: first, erroneous values from sources may prevent correct linking; second, the real world may contain exceptions to the uniqueness constraints and always enforcing uniqueness can miss correct values; third, locally resolving conflicts for linked records may overlook important global evidence. This paper proposes a novel technique to solve this problem. The key component of our solution is to reduce the problem into a k-partite graph clustering problem and consider in clustering both similarity of attribute values and the sources that associate a pair of values in the same record. Thus, we perform global linkage and fusion simultaneously, and can identify incorrect values and differentiate them from alternative representations of the correct value from the beginning. In addition, we extend our algorithm to be tolerant to a few violations of the uniqueness constraints. Experimental results show accuracy and scalability of our technique.
منابع مشابه
An Operation Aware Flash Translation Layer for Enterprise-class Ssds
A novel method to extend flash memory lifetime in flash-based DBMS Abstract: As the capacity increases and the price drops gradually, flash memory is becoming the promising replacement of disk, even in the enterprise applications. However, flash memory suffers from erase-before-write and limited write-erase cycles at the same time, which means the abuse of write,especially small and random writ...
متن کاملData Quality: Automated Edit/Imputation and Record Linkage
Statistical agencies collect data from surveys and create data warehouses by combining data from a variety of sources. To be suitable for analytic purposes, the files must be relatively free of error. Record linkage (Fellegi and Sunter, JASA 1969) is used for identifying duplicates within a file or across a set of files. Statistical data editing and imputation (Fellegi and Holt, JASA 1976) are ...
متن کاملMETHODOLOGIC ISSUES Practical introduction to record linkage for injury research
s, and EMS data initially using a deterministic computer program, but when Automatch became available, the latter was found easier to specify and generalize. Hospital readmissions for injury in New Zealand were more easily identified by probabilistic methods, which allowed for more variables to be used for linkage than a deterministic method, even when some values were missing or erroneous. The...
متن کاملProbabilistic Linkage of Persian Record with Missing Data
Extended Abstract. When the comprehensive information about a topic is scattered among two or more data sets, using only one of those data sets would lead to information loss available in other data sets. Hence, it is necessary to integrate scattered information to a comprehensive unique data set. On the other hand, sometimes we are interested in recognition of duplications in a data set. The i...
متن کاملUniqueness of meromorphic functions dealing with multiple values in an angular domain
This paper uses the Tsuji’s characteristic to investigate the uniqueness of transcen- dental meromorphic function with shared values in an angular domain dealing with the multiple values which improve a result of J. Zheng.
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- PVLDB
دوره 3 شماره
صفحات -
تاریخ انتشار 2010